This project provided us with the opportunity of showcasing many of the skills we have learned throughout this course and of applying them to an investigation into datasets of our choosing. We narrowed our scope to a few datasets containing information on social economic information, namely unemployment and crime data in NYC. We hoped that this investigation would reveal valuable information that could be used to formulate policy proposals. This project provided us with the opportunity of showcasing many of the skills we have learned throughout this course and of applying them to an investigation into datasets of our choosing. We narrowed our scope to a few datasets containing information on social economic information, namely unemployment crime data in NYC. We hoped that this investigation would reveal valuable information that could be used to formulate policy proposals.
We used the following workflow for each dataset:
We then merged the datasets to explore further and try to draw some final conslusions.
workflowchart
source("environment_setup.R", echo = T, prompt.echo = "", spaced = F)
## if (!require("dplyr")) install.packages("dplyr")
## if (!require("RSocrata")) install.packages("RSocrata")
## if (!require("tidyverse")) install.packages("tidyverse")
## if (!require("ggplot2")) install.packages("ggplot2")
## if (!require("readxl")) install.packages("readxl")
## if (!require("plyr")) install.packages("plyr")
## if (!require("treemap")) install.packages("treemap")
## if (!require("leaflet")) install.packages("leaflet")
## if (!require("forcats")) install.packages("forcats")
## if (!require("ggExtra")) install.packages("ggExtra")
## if (!require("GGally")) install.packages("GGally")
| variable | description |
|---|---|
arrest_date |
Exact date of arrest for the reported event. |
ofns_desc |
Description of internal classification corresponding with KY code (more general category than PD description). |
arrest_boro |
Borough of arrest. B(Bronx), S(Staten Island), K(Brooklyn), M(Manhattan), Q(Queens) |
age_group |
Perpetrator’s age within a category. |
perp_sex |
Perpetrator’s sex description. |
perp_race |
Perpetrator’s race description. |
x_coord_cd |
Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104). |
y_coord_cd |
Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |
latitude |
Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
longitude |
Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
Load the data into R using the RSocrata API.
source("arrests_dataset.R", echo = F, prompt.echo = "", spaced = F)
head(arrests_df, 10)
Rename the borough letters to proper names.
arrests_df$arrest_boro <- revalue(arrests_df$arrest_boro, c("Q"="Queens", "K"="Brooklyn", "M"="Manhatttan", "S"="Staten Island", "B" = "Bronx"))
Remove missing values where no offense description is recorded.
arrests_df <- arrests_df %>% filter(ofns_desc != "")
murder_counts <- arrests_df %>%
group_by(arrest_boro, year, perp_race) %>%
dplyr::summarise(murder_counts = n()) %>%
arrange(desc(year))
murder_counts
Let’s study the evolution of crime over the period of interest (2014-2018).
grouped_boro <- arrests_df %>%
group_by(year, arrest_boro) %>%
dplyr::summarize(count = n()) %>%
arrange(desc(count))
What the plot below reveals is that overall crime is decreasing for all boroughs of NYC. The data year over year is very similar, appearing to simply scale down over time.
What we can note as suprising is the fact that total crime between Manhattan and Brooklyn is at fairly similar levels. Total crime is aggregated without accounting for different types of crime so we will further our investigation by dissecting crime per borough.
ggplot(grouped_boro, aes(x = reorder(year, -count), y = count, fill = arrest_boro)) +
geom_bar(stat = 'identity', position = position_dodge()) +
scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), breaks = seq(0,120000,10000)) +
xlab("year") + ylab("total crime") + ggtitle("Crime by Borough Time Series") +
scale_fill_brewer(palette="Blues") + theme_minimal()
grouped_offenses <- arrests_df %>%
group_by(year, arrest_boro, ofns_desc) %>%
dplyr::summarize(count = n())
t5 <- grouped_offenses %>% top_n(5)
## Selecting by count
ggplot(t5, aes(x = reorder(arrest_boro, -count), y = count, fill=ofns_desc)) +
geom_bar(stat = 'identity', position = position_dodge()) +
scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), breaks = seq(0,80000,5000)) +
xlab("borough") + ylab("crime rate") + ggtitle("Most Common Crimes by Borough 2014-2018") +
scale_fill_brewer(palette="Blues") + theme_minimal()
Perhaps we can comment on effective crime measures by looking at the least common crimes. Fix this plot
ggplot(grouped_offenses %>% top_n(-25), aes(x = reorder(ofns_desc, -count), y = count)) +
geom_bar(stat = 'identity', fill= 'lightblue') +
coord_flip() +
xlab("offense") + ylab("count") + ggtitle("Least Common Crimes 2014-2018") +
theme(axis.text.x = element_text(size=10), axis.text.y = element_text(size=8))
drugs <- arrests_df %>% filter(ofns_desc == 'DANGEROUS DRUGS') %>% group_by(year, arrest_boro) %>% dplyr::summarize(count = n())
ggplot(drugs, aes(x = year, y = count, color = arrest_boro)) +
geom_line() +
xlab("year") + ylab("count") + ggtitle("Dangerous Drugs Crime by Bourough") +
theme(axis.text.x = element_text(size=10), axis.text.y = element_text(size=8))
ggplot(arrests_df, aes(x = age_group, fill = perp_sex)) +
geom_histogram(stat = "count", position=position_dodge()) +
scale_fill_brewer(palette="Blues") +
xlab("age group") + ylab("count") + ggtitle("Perpetrator Age Group and Gender Distribution") +
scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), breaks = seq(0,7000000,100000))
grouped_arrests <- arrests_df %>% group_by(year, arrest_boro, ofns_desc, age_group, perp_sex, perp_race) %>%
dplyr::summarize(count = n())